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Structured Abstract 

Context: 

Developers design test suites to automatically verify that 
software meets its expected behaviors. Many dynamic analysis 
techniques are performed on the exploitation of execution 
traces from test cases. However, in practice, there is only one 
trace that results from the execution of one manually-written 
test case. 

Objective: 

In this paper, we propose a new technique of test suite refac¬ 
toring, called B-Refactoring. The idea behind B-Refactoring 
is to split a test case into small test fragments, which cover a 
simpler part of the control flow to provide better support for 
dynamic analysis. 

Method: 

For a given dynamic analysis technique, our test suite refac¬ 
toring approach monitors the execution of test cases and 
identifies small test cases without loss of the test ability. 
We apply B-Refactoring to assist two existing analysis tasks; 
automatic repair of if-condition bugs and automatic analysis 
of exception contracts. 

Results: 

Experimental results show that test suite refactoring can ef¬ 
fectively simplify the execution traces of the test suite. Three 
real-world bugs that could previously not be fixed with the 
original test suite are fixed after applying B-Refactoring; 
meanwhile, exception contracts are better verified via applying 
B-Refactoring to original test suites. 

Conclusions: 

We conclude that applying B-Refactoring can effectively im¬ 
prove the purity of test cases. Existing dynamic analysis tasks 
can be enhanced by test suite refactoring. 

I. Introduction 

Developers design and write test suites to automatically 
verify that software meets its expected behaviors. Eor instance, 
in regression testing, the role of a test suite is to catch new 
bugs - the regressions - after changes |^. Test suites are 
used in a wide range of dynamic ana ysis techniques: in 
fault localization, a test suite is executed for infemng the 
location of bugs by reasoning on code coverage GD ; in 
invariant discovery, input points in a test suite are used to 
infer likely program invariants Q; in software repair, a test 
suite is employed to verify the behavior of synthesized patches 
|[20). Many dynamic analysis techniques are based on the 


exploitation of execution traces obtained by each test case 0, 

0, ii- 

Different types of dynamic analysis techniques require 
different types of traces. The accuracy of dynamic analysis 
depends on the structure of those traces, such as length, 
diversity, redundancy, etc. Eor example, several traces that 
cover the same paths with different input values are very 
useful for discovering program invariants fault localization 
benefits from traces that cover different execution path s p) 
and that are ttiggered by assertions in different test cases ]4l| . 
However, in practice, one manually-written test case results in 
only one trace during the test suite execution; on the other 
hand, test suite execution traces can be optimal with respect 
to test suite comprehension (from the human viewpoint by 
authors of the test suite) but might be suboptimal with respect 
to other criteria (from the viewpoint of dynamic analysis 
techniques). 

In this paper, instead of having a single test suite used 
for many analysis tasks, our hypothesis is that a system can 
automatically optimize the design of a test suite with respect 
to the requirements of a given dynamic analysis technique. Eor 
instance, given an original test suite, developers can have an 
optimized version with respect to fault localization as well as 
another optimized version with respect to automatic software 
repair. This optimization can be made on demand for a specific 
type of dynamic analysis. 

Our approach to test suite refactoring, called B- 
Refactoring|H detects and splits impure test cases. In our work, 
an impure test case is a test case, which executes unprocessable 
path in one dynamic analysis technique. The idea behind B- 
Refactoring is to split a test case into small “test fragments”, 
where each fragment is a completely valid test case and covers 
a simple part of the control flow, test fragments after splitting 
provide better support for dynamic software analysis. The 
purified test suite is semantically equivalent to the original one; 
it triggers exactly the same set of behaviors as the original test 
suite and detects exactly the same bugs. However, it produces 
a different set of execution traces. This set of traces suits 
better for the targeted dynamic program analysis. Note that our 
definition of purity is specific to test cases and is completely 
different from the one used in the programming language 
literature (e.g., pure and impure functional programming in 

0, KH), ID). 

To evaluate our approach, we consider two dynamic analy- 

^B-Refactoring is short for Banana-Refactoring. We name our approach with 
Banana because we split a test case as splitting a banana in the ice cream 
named Banana Split. 




sis techniques, one in the domain of automatic software repair 
and the other in the context of dynamic verification of 
exception contracts 0- We briefly present the case of software 
repair here and will present in details the dynamic verification 
of exception contracts in Section V-B For software repair, we 
consider Nopol Q, an automatic repair system for bugs in if 
conditions. Nopol employs a dynamic analysis technique that 
is sensitive to the design of test suites. The efficiency of Nopol 
depends on whether the same test case executes both then and 
else branches of an if. This forms a purification criterion 
that is given as input to our test suite refactoring technique. In 
our dataset, we show that purification yields 66.34% increase 
in the number of purely tested ifs (3,746 instead of 2,252) 
and unlocks new bugs which are able to be fixed by the purified 
test suite. 


Prior work. Our work pT) shows that traces by an original 
test suite are suboptimal with respect to fault localization. 
The original test suite is updated to enhance the usage of 
assertions in fault localization. In the current paper, we propose 
a generalized way of test suite refactoring, which optimizes 
the usage of the whole test suite according to a given dynamic 
analysis technique. 

This paper makes the following major contributions. 

1. We formulate the problem of automatic test case refac¬ 
toring for dynamic analysis. The concept of pure and impure 
test cases is generalized to any type of program element. 

2. We propose B-Refactoring, an approach for automati¬ 
cally refactoring test suites according to a specific criterion. 
This approach detects and refactors impure test cases based 
on analyzing execution traces. The test suite after refactoring 
consists of smaller test cases that do not reduce the potential 
of bug detection. 

3. We apply B-Refactoring to assist two existing dynamic 
analysis tasks from the literature: automatic repair of if- 
condition bugs and automatic analysis of exception contracts. 
Three real-world bugs that could not be fixed with original test 
suites are fixed after test suite refactoring; exception contracts 
are better verified by applying B-Refactoring to original test 
suites. 


The remainder of this paper is organized as follows. In 
Section |I^ we introduce the background and motivation of 
test suite refactoring. In Section III we define the problem of 
refactoring test suites and propose our approach B-Refactoring. 
In Section |IV[ we evaluate our approach on five open-source 
projects; in Section |V] we apply the approach to automatic 
repair and exception contract analysis. Section VI discusses 


the threats to validity. Section |VII| lists the related work and 
Section VIII concludes our work. Section Appendix describes 
two case studies of real-world bugs, which are fixed by 
applying test suite refactoring. 


II. Background and Motivation 

Test suite refactoring can be used in different tasks of 
dynamic analysis. In this section, we present one application 
scenario, i.e., if-condition bug repair. However, test suite 
refactoring is general and goes beyond software repair. Another 
application scenario of exception handling can be found in 
Section IV-BI 


A. Pitfall of Repairing Real-World Bugs 

In test suite based repair |3g, 1^, @, a repair method 
generates a patch for potentially buggy statements and then 
validates the patch with a given test suite. For example, a well- 
known test suite based method, GenProg p9) , employs genetic 
programming to generate patch code via updating Abstract 
Syntax Tree (AST) nodes in C programs. The generated patch 
is to pass the whole test suite. 


Research community of test suite based repair has devel¬ 
oped fruitful results, such as GenProg by Le Goues et al. 


Par by Kim et al. 118|, and SemFix (27|. However, applying 
automatic repair to real-world bugs is unexpectedly difficult. 


In 2012, a case study by Le Goues et al. II3_ showed 
that 55 out of 105 bugs can be fixed by GenProg ||20[. This 
work has set a milestone for the real-world application of 
test suite based repair techniques. Two years later, Qi et al. 
| |29) empirically explored the search strategy (genetic program¬ 
ming) inside GenProg and showed that this strategy does not 
perform better than random search. Their proposed approach 
based on random search, RSRepair, worked more efficiently 
than GenProg. Recent work by Qi et al. |3§ has examined 
the “plausible” results in the experimental configuration of 
GenProg and RSRepair. Their result pointed out that only 2 out 
of 55 patches that are reported in GenProg are actually 
correct; 2 of the 24 patches that are reported in RSRepair p9) 
are correct. All the other reportedly fixed bugs suffer from 
problematic experimental issues and unmeaningful patches. 


Reparing real-world bugs is not easy. Test suite based 
repair is able to generate a patch, which passes the whole test 
suite. But this patch may not behaves the same functionality 
as the real-world patches. In other words, a patch by a test 
suite based repair technique could be semantically incorrect, 
comparing with a manually-written patch by developers. 


B. Automatic Software Repair with Nopol 


Test suites in repairing real-world bugs are worth inves¬ 
tigation. A test suite plays a key role in validating whether 
a generated patch Axes a bug and behaves correctly in test 
suite based repair. The quality of test suites impacts the 
patch generation in automatic repair. The test suite refactoring 
technique, addressed in this paper, is to enhance the given 
test suite to assist automatic repair (as well as other dynamic 
analysis techniques in Section V-B|i. 


To motivate our test suite refactoring in the context of 
software repair, we introduce an existing repair approach, 
Nopol 0 - Nopol focuses on fixing bugs in if conditions. 
To generate a patch for an if condition, Nopol requires at 
least one failing test case and one passing test case. To avoid 
Nopol to generate a trivial Ax (e.g., if (true)), test cases 
have to cover both the then branch and the else branch. 


However in practice, one test case may cover both then 
and else branches together. This results in an ambiguous 
behavior for the repair approach, Nopol. In the best case, the 
repair approach discards this test case and continues the repair 
process; in the worst case, the repair approach cannot fix the 
bug because discarding the test case leads to a lack of test 
cases. In this paper, the test suite refactoring technique that 














1 public double factorialDouble(flnal int n) { 

2 if (n < 0) { 

3 throw new IllegalArgumentException( 

4 "must have n >= 0 for n!"); 

5 1 

6 return Math.floor(Math.exp( factorialLog (n)) + 0.5); 

7 } 

8 

9 public double factorialLog(final int n) { 

10 //PATCH: if (n < 0) { 

11 if (n <= 0) { 

12 throw new IllegalArgumentException( 

13 "must have n > 0 for n!”); 

14 } 

15 double logSum = 0; 

16 for (int i = 2; i <= n; i++) { 

17 logSum += Math.log((double) i); 

18 } 

19 return logSum; 

20 } 


1 public void testFactorial() { //Passing test case 

2 

3 try ( 

4 double x = MathUtils.factorialDouble(-l); 

5 fail(”expecting^IllegalArgumentException"); 

6 } catch (IllegalArgumentException ex) { 

7 ; 

8 } 




9 

10 

11 

12 

13 

14 

15 

16 
17 } 


try { 

double X = MathUtils.factorialLog(-l); 
fail(”expecting^IllegalArgumentException"); 
} catch (IllegalArgumentException ex) { 


assertTrue("expecting^infinite^factorial^value", 

Double.isInfinite(MathUtils.factorialDouble( 171))) 




18 public void testFactorialFail() { //Failing test case 

19 

20 assertEqualsC'O", O.Od, MathUtils.factorialLog(O), lE-14); 

21 } 


1 // The first fragment must execute the setUp code 

2 @TestFragment(origin=testFactorial, order=l) 

3 void testFactorial_fragment_l () { 

4 setUpO; 

5 //Lines from 2 to 14 in Fig. lb executing then branch 

6 1 

7 

8 // Split between Line 14 and Line 15 in Fig. lb 

9 

10// The last fragment must execute the tearDown code 

11 @TestFragment(origin=testFactorial, order=2) 

12 void testtestFactorialFail_fragment_2 () { 

13 //Lines from 15 to 16 in Fig. Ih executing else branch 

14 tearDownO; 

15 1 

16 

17 //Already pure test case 

18 ©Test 

19 public void testFactorialFail() { 

20 // Executes the then branch 

21 } 


(a) Buggy program 


(b) Two original test cases 


(c) Three test cases after purification 


lb I and 


Fig. 1: Example of test case purification. The buggy program and test cases are extracted from Apache Commons Math. The 
buggy if is at Line 11 of Fig. la A test case testFactorial in Fig. m executes both then (at Line 10 of Fig. [ 
else (at Line 15 of Fig. [Tb]i branches of the if (at Line 11 of Fig. [^. Fig. shows the test cases after the splitting (h 
Lines 14 and 15) according to the execution on branches. 


retween 


we will present enables a repair approach to fix previously- 
unfixed bugs. 

We choose Nopol as the automatic repair approach in our 
experiment for the following reasons. First, Nopol is developed 
by our group ||^ and is open-source available]^ Second, the 
target of Nopol is only to fix if-condition bugs; such a target 
will narrow down the bugs under study and reduce the impact 
by different kinds of bugs pi6\ . 


C. Real-World Example: Apache Commons Math 

We use a real-world bug in Apache Commons Math to 
illustrate the motivation of our work. Apache Commons Math 
is a Java library of self-contained mathematics and statistics 
components]^ Fig. shows code snippets from the Apache 
Commons Math project. It consists of real-world code of a 
program with a bug in an if and two related test cases 
The program in Fig. 1^ is designed to calculate the factorial, 
including two methods: f actorialDouble for the factorial 
of a real number and factorialLog for calculating the 
natural logarithm of the factorial. The bug, shown in the if 
condition n<=0 at Line 11, should actually be n<0. 

Fig. ^ displays two test cases that execute the buggy 
if condition: a passing one and a failing one. The failing 
test case detects that a bug exists in the program while the 
passing test case validates the correct behavior. In Fig. 
we can observe that test code before Line 14 in the test 
case testFactorial executes the then branch while test 
code after Line 15 executes the else branch. Consequently, 
Nopol fails to repair this bug because it cannot distinguish the 
executed branch (the then branch or the else one). 


^Nopol Project, https://github.com/SpoonLabs/nopol/ 

^Apache Commons Math, http://commons.apache.org/math/ 

'^See https://fisheye6.atlassian.com/changelog/commons?cs=141473 


Is there any way to split this test case into two parts 


according to the execution on branches? Fig. k ’ shows 
two test cases after splitting the test case testFactorial 
between Lines 14 and 15. Based on the test cases after splitting, 
Nopol works well and is able to generate a correct patch as 
expected. The test case splitting motivates our work: refining 
a test case to cover simpler parts of the control flow during 
program execution. 

Test suite purification can be applied prior to different 
dynamic analysis techniques and not only to software repair. 
Section V-B| presents another application scenario, i.e., excep¬ 
tion contract analysis. 


III. Test Suite Reeactoring 

In this section, we present basic concepts of test suite 
refactoring, our proposed approach, and important technical 
aspects. 


A. Basic Concepts 

In this paper, a program element denotes an entity in the 
code of a program, in opposition to a test constituent that 
denotes an entity in the code of a test case. We use the terms 
element and constituent for sake of being always clear whether 
we refer to the applicative program or its test suite. Any node 
in an Abstract Syntax Tree (AST) of the program (resp. the 
test suite) can be considered as a program element (resp. a test 
constituent). For example, an if element and a try element 
denote an If element and a try element in Java, respectively. 


^Note that in Fig. [ii the first two test cases after splitting have extra 
annotations like @TestFragment at Line 2 as well as extra code like setUp 
at line 4 and tearDown at Line 15. We add these lines to facilitate the test 
execution, which will be introduced in Section [in-C| 





















1) Execution Domain: 

Definition 1. Let be a set of program elements in the same 
type of AST nodes. The execution domain D of a program 
element e S i? is a set of code that characterizes one execution 
of e. 

For instance, for an i f element, the execution domain can 
be defined as 

Dj^f = {then-branch, else-branch} 

where then-branch and else-branch are the execution 
of the then branch and the else branch, respectively. 

The execution domain is a generic concept. Besides 
If, two examples of potential execution domains are as 
follows: the execution domain of a method invocation 
func{vara, vart, ...) is jxi, X 2 ,... ,x„} where x^ is a vector 
of actual values of arguments in a method invocation; the exe¬ 
cution domain of switch-case is jcasci, case 2 ,..., case„} 
where casci is a case in the switch. 

For try elements, we dehne the execution as follows 

Utry = jno-exception,exception-caught, 
exception-not-caught} 

where no-exception, exception-caught, and 
exception-not-caught are the execution results of 
try element: no exception is thrown, one exception is caught 
by the catch block, and one exception is thrown in the catch 
block but not caught, respectively. The execution domain 
of try will be used in dynamic verihcation of exception 
handling in Section [V^ 

2) Execution Purity and Impurity: Values in D are mutually 
exclusive: a single execution of a program element is uniquely 
classihed in D. During the execution of a test case, a program 
element e G E may be executed multiple times. 

We refer to an execution result of a program element as 
an execution signature. A pure execution signature denotes the 
execution of a program element, which yields a single value 
in an execution domain D, e.g., only the then branch of if 
is executed by one given test case t. An impure execution 
signature denotes the execution of a program element with 
multiple values in D, e.g., both then and else branches are 
executed by t. Given an execution domain, let Dq = impure 
be the set of impure execution signatures. 

The execution signature of a test case t with respect to an 
element e is the aggregation of each value as follows. Let T 
be the test suite, the set of all test cases, we define 

f : E xT ^ DUD° U{±} 

where _L (usually called “bottom”) denotes that the test case t 
does not execute the program element e. For example, /(e, t) G 
Dq indicates both then and else branches of an if element 
e is executed by a test case t. If the test case executes the same 
element always in the same way (e.g., the test case always 
executes then in an if element), we call it pure. Note that 
for the simplest case, a set of program elements may consist 
of only one program element. 


In this paper, we consider a test case f as a sequence of 
test constituents, i.e., t = (ci, C 2 ,..., c„). Let C denote the 
set of Ci (1 < i < n). Then the above function f{e,t) can 
be rehned for the execution of a test constituent c G C. A 
function g gives the purity of a program element according to 
a test constituent: 

g:ExC^D\JD°\j{l.} 

A test constituent c is pure on E if and only if (Ve G 
E) g{e, c) G D U {-L}; c is impure on E if and only if (3e G 
E) g{e,c) G D°. 

Definition 2. Given a set E of program elements and a test 
case t G T, let us dehne the impurity indicator function 6 : 
£ X T, where £ is a set of all the candidate sets of program 
elements. In details, S{E,t) = 0 if and only if the test case t 
is pure (on the set E of program elements) while S{E,t) = 1 
if and only if t is impure. Formally, 

0 pure, iff (Ve G E) f{e,t) G DU {_L} 

1 impure, iff (3e G E) f{e,t) G D^ 

At the test constituent level, the above dehnition of pu¬ 
rity and impurity of a test case can be stated as follows. 
A test case t is pure if (3a: G D) (Ve G E) (Ve G 
C) g{e,c) G {a:} U {-L}. A test case t is impure if t 
contains either at least one impure constituent or at least two 
different execution signatures on constituents. That is, either 
(3e G E)(3c G C)g(e,c) G D° or (3e G i 3 )( 3 ci,C 2 G 
C) (s(e, Cl) 7 ^ 5 (e, C 2 )) A {g{e, ci), g(e, C 2 ) G D) holds. 

An absolutely impure test case according to a set E of 
program elements is a test case for which there exists at least 
one impure test constituent: (3e G E) (3c G C) g{e, c) G D^. 

Definition 3. A program element e is said to be purely covered 
according to a test suite T if all test cases yield pure execution 
signatures: (Vf G T) f(e,t) ^ D^. A program element e is 
impurely covered according to T if any test case yields an 
impure execution signature: (3f G T) f{e,t) G D^. This 
concept will be used to indicate the purity of test cases in 
Section |I3 

Note that the above dehnitions are independent of the 
number of assertions per test case. Even if there is a single 
assertion, the code before the assertion may explore the full 
execution domain of certain program elements. 

3) Test Case Purihcation: Test case refactoring aims to 

rearrange test cases according to a certain task 113 , @. 

Test case purification is a type of test case refactoring that aims 
to minimize the number of impure test cases. In this paper, our 
definition of purity involves a set of program elements, hence 
there are multiple kinds of feasible purihcation, depending on 
the considered program elements. For instance, developers can 
purify a test suite with respect to a set of ifs or with respect 
to a set of trys, etc. 

Based on Dehnition the task of test case purihcation 
for a set E of program elements is to hnd a test suite T that 
minimizes the amount of impurity as follows: 

arg min E 5{Efi) (1) 

tGT 





TABLE I: Example of three test fragments and the execution 
signature of an if element. 


Test constituent 

Cl 

C2 

C 3 

C4 

C 5 

C6 

C7 

Execution 

signature 

L 

then- 

branch 

L 

else- 

branch 

L 

else- 

branch 

then- 

branch 

Test fragment 


vCl, C2, C3 


(C4, C5, Cq) 

(cy) 


The minimum of 0 when all test cases 

in T are pure. As shown later, this is usually not possible 
in practice. Note that, in this paper, we do not aim to find 
the absolutely optimal purified test suite, but finding a test 
suite that improves dynamic analysis techniques. An impure 
test case can be split into a set of smaller test cases that are 
possibly pure. 

Definition 4. A test fragment is a continuous sequence of 
test constituents. Given a set of program elements and a test 
case, i.e., a continuous sequence of test constituents, a pure test 
fragment is a test fragment that includes only pure constituents. 


Ideally, an impure test case without any impure test con¬ 
stituent can be split into a sequence of pure test fragments, e.g., 
a test case consisting of two test constituents, which covers 
then and else branches, respectively. Given a set E of 
program elements and an impure test case t = (ci,...,c„) 
where (Ve S E) g{e,Ci) G D U {_L} (1 < i < n), we can 
split the test case into a set of m test fragments with test case 
purification. Let (pj be the jth test fragment (1 < j < m) 
in t. Let Cj denote the fcth test constituent in pj and \pj\ 
denote the number of test constituents in pj. We define pj as 
a continuous sequence of test constituents as follows 

where {3x S D) (Ve G E) (Vc) g{e,c) G {x} U {_L}. 

Based on the above definitions, given a test case without 
impure test constituents, the goal of test case purification is to 
generate a minimized number of pure test fragments. 

Example of test case purification. In the best case, 
an impure test case can be refactored into a set of test 
fragments as above. Table presents an example of test 
case purification for a test case with seven test constituents 
t = (ci, C 2 , C 3 , C 4 , C 5 , C 5 , Cg, C 7 ) that are executed on a set of 
if elements consisting of only one if element. Three test 
fragments are formed as ( 01 , 02 , 03 ), ( 04 , 05 , 05 ), and ( 07 ). In 
test case purification, an absolutely impure test case (a test 
case with at least one impure test constituent) necessarily 
results in at least one impure test fragment (one test fragment 
containing the impure test constituent) and zero or more pure 
test fragments. 



Eig. 2: Conceptual framework of test case purification. This 
framework takes a program with test cases and a specific set 
of program elements (e.g., if elements) as input; the output is 
new test cases based on test fragments. The sum of test cases 
in (a), (b), and (c) equals to the number of original test cases. 


B. B-Refactoring — A Test Suite Refactoring Approach 

As mentioned in Section III-A3 [ test case purification can 
be implemented in different ways according to given dynamic 
analysis techniques. Our prior work ED purifies failing test 
cases based on assertions for fault localization (a dynamic 
analysis technique of identifying the root cause of a bug). 


In this paper, we present B-Refactoring, our approach to 
automatic refactoring a whole test suite. B-Refactoring is a 
generalized approach for dynamic analysis techniques and their 
various program elements. 


1) Eramework: B-Refactoring refactors a test suite accord¬ 

ing to a criterion defined with a set of specific program 
elements (e.g., if elements) in order to purify its execution 
(according to the execution signatures in Section III-Ai. In a 
nutshell, B-Refactoring takes the original test suite and the 
requested set of program elements as input and generates 
purified test cases as output. 


Note that the goal of test case purification is not to 
replace the original test suite, but to enhance dynamic analysis 
techniques. Test case purification is done on-demand, just 
before executing a specific dynamic analysis. Consequently, it 
has no impact on future maintenance of test cases. In particular, 
new test cases potentially created by purification are not meant 
to be read or modified by developers. 


Eig. 1^ illustrates the overall structure of our approach. We 
first monitor the execution of test cases on the requested set E 
of program elements. We record all the test cases that execute 
E and collect the execution signatures of test constituents in 
the recorded test cases. Second, we filter out pure test cases that 
already exist in the test suite. Third, we divide the remaining 
test cases into two categories: test cases with or without impure 























































































test constituents. For each category, we split the test cases into 
a set of test fragments. As a result, a new test suite is created, 
whose execution according to a set of program element is purer 
than the execution of the original test suite. 

In this paper, we consider a test constituent as a top-level 
statement in a test case. Examples of test constituents could 
be an assignment, a complete loop, a try block, a method 
invocation, etc. Our B-Refactoring does not try to split the 
statements that are inside a loop or a try branch in a test 
case. 

2) Core Algorithm; Algorithm describes how B- 
Refactoring splits a test case into a sequence of test fragments. 
As mentioned in Section |III-B1[ the input is a test case and a 
set of program elements to be purified while the output is a 
set of test fragments. 

Algorithm [^returns a minimized set of pure test fragments 
and a set of impure test fragments. In the algorithm, each 
impure test constituent is kept and directly transformed as 
an atomically impure test case that consists of only one 
constituent. The remaining continuous test constituents are 
clustered into several pure test fragments. Algorithm[T]consists 
of two major steps. First, we traverse all the test constituents to 
collect the last test constituent of each test fragment. Second, 
based on such collection, we split the test case into pure or 
impure test fragments. These test fragments can be directly 
treated as test cases for a dynamic analysis applications. 

Taking the test case in Table as an example, we briefly 
describe the process of Algorithm [T] The traversal at Line 

f consists of only one program element according to Table 
If only one of the then and else branches is executed, 
we record this branch for the following traversal of the test 
case (at Line [T^ . If a test constituent with a new execution 
signature appears, its previous test constituent is collected as 
the last constituent of a test fragment and the next test fragment 
is initialized (at Line [T^ . That is, C 3 and cg in Table are 
collected as the last constituents. The end constituent of the 
test case is collected as the last constituent of the last test 
fragment (at Line [20)l, i.e., cy. Lines from [7] to are not run 
because there is no impure test constituent in Table [I| After 
the traversal of all the test constituents. Lines from to 
are executed to obtain the final three test fragments based on 
the collection of C3, Cg, and cy. 


3) Validation of the Refactored Test Suite; Our algorithm for 
refactoring a test suite is meant to be semantics preserving. In 
other words, the refactored test suite should specify exactly the 
same behavior as the original one. We use mutation testing 
to validate that the refactored test suite is equivalent to the 
original one GD- The idea of such validation is that all 
mutants killed by the original test suite must also be killed 
by the refactored one. Since in practice, it is impossible to 
enumerate all mutants, this validation is an approximation 
of the equivalence before and after B-Refactoring. We will 


present the validation results in Section IV-D 


C. Implementation 

We now discuss important technical aspects of B- 
Refactoring. Our tool, B-Refactoring, is implemented in Java 
1.7 and JUnit 4.11. For test cases written in JUnit 3, we use 


Input ; 

E, a set of program elements; 

t = (ci,..., Cn), a test case with n test constituents; 

D, an execution domain of the program elements in E. 

Output; 

fl), a set of test fragments. 

1 Let C be an empty set of last constituents in fragments; 

2 Let V =_L be a default execution signature; 


3 

4 

5 

6 

7 

8 
9 

10 

11 

12 

13 

14 

15 

16 

17 

18 

19 

20 


foreach program element e £ E Ao 
V =_L; 

foreach test constituent Ci in t (1 < i < n) do 
if p(e,c,) G 77° then // Impure constituent 
V =_L; 

C = C U Cj — 1; // End of the previous fragment 
C = C U C^; // Impure fragment of one constituent 
ji else if g{e, Ci) G D then // Pure constituent 
if V =J_ then 
|| I v = g{e,Ci)- 

l| else it V ^ g{e, Ci) then II v G D 
C = C U Ci_i; 

|| v = g{e,Ci)-, 

end 

end 

end 

end 

C = C U c„; // Last constituent of the last fragment 


21 

22 

23 

24 

25 

26 


Let c+ = ci; 

foreach test constituent cj in C do 

(fi = {c'^, ...jCj); // Creation of a test fragment 
<i) = $ U yj; 
c+ = Cj+i; 


Algorithm 1: Splitting a test case into a set of test fragments 
according to a given set of program elements. 


a converter to adapt them to JUnit 4. Our tool is developed 
on top of Spoon, a Java library for source code transformation 
and analysis]^ B-Refactoring handles a number of interesting 
cases and uses its own test driver to take them into account. 


1) Execution Order; To ensure the execution order of test 
fragments, the B-Refactoring test driver uses a specific anno¬ 
tation @TestFragment (origin, order) to execute test 
fragments in a correct order. Test methods are automatically 
tagged with the annotation during purification. Examples of 
this annotation are shown in Fig. Ic Also, when test frag¬ 
ments use variables that were local to the test method before 
refactoring, they are changed as fields of the test class. In case 
of name conflicts, they are automatically renamed in a unique 
way. 


2) Handling setUp and tearDown; Unit testing can make 
use of common setup and finalization code. JUnit 4 uses Java 
annotations to facilitate writing this code. For each test case, 
a setup method (with the annotation @Before in JUnit 4) 
and a tearDown method (with @After) are executed before 
and after the test case, e.g., initializing a local variable before 


^Spoon 2.0, http://spoon.gforge.inria.fr/ 















TABLE II; Projects in empirical evaluation. 


Project 

Description 

Source LoC 

#Test cases 

Lang 

A Java library for manipulating core classes 

65,628 

2,254 

Spojo-core 

A rule-based transformation tool for Java beans 

2,304 

133 

Jbehave-core 

A framework for behavior-driven development 

18,168 

457 

Shindig-gadgets 

A container to allow sites to start hosting social apps 

59,043 

2,002 

Codec 

A Java library for encoding and decoding 

13,948 

619 

Total 


159,091 

5,465 


the execution of the test case and resetting a variable after the 
execution, respectively. In B-Refactoring, to ensure the same 
execution of a given test case before and after refactoring, we 
include setUp and tearDown methods in the first and the 


last test fragments. This is illustrated in Fig. Ic 


We will show that applying refactoring can improve the purity 


for individual program elements in Section IV-C 


1) Protocol: We focus on the following metrics to present 

the purity level of test cases: 


3) Shared Variables in a Test Case: Some variables in a test 
case may be shared by multiple statements, e.g., one common 
variable in two assertions. In B-Refactoring, to split a test case 
into multiple ones, a shared variable in a test case is renamed 
and extracted as a class field. Then each new test case can 
access this variable; meanwhile, the behavior of the original 
test case is not changed. Experiments in Section IV-D will also 
confirm the unchanged behavior of test cases. 


IV. Empirical Study on Test Suite Reeactoring 

In this section, we evaluate our technique for refactoring 
test suites. This work addresses a novel problem statement: 
refactoring a test suite to enhance dynamic analysis. To our 
knowledge, there is no similar technique that can be used 
to compare against. However, a number of essential research 
questions have to be answered. 


A. Projects 

We evaluate our test suite refactoring technique on five 
open-source Java projects: Apache Commons Lang (Lang for 
shortjj^ Spojo-corej^ Jbehave-corej^ Apache Shindig Gadgets 
(Shindig-gadgets for short)|^ and Apache Commons Codec 
(Codec for short){^ These projects are all under the umbrella 
of respectful code organizations (three out of five projects by 
Apach^^. 


B. Empirical Observation on Test Case Purity 


RQl: What is the purity of test cases in our dataset? 


We empirically study the purity of test cases for two types 
of program elements, i.e., if elements and try elements. The 
goal of this empirical study is to measure the existing purity 
of test cases for if and try before refactoring. The analysis 
for if will facilitate the study on software repair in Section 


V-A while the analysis for try will facilitate the study on 
dynamic verification of exception contracts in Section |V-B 


^Apache Commons Lang 3.2, http://commons.apache.org/Iang/ 
*Spojo-core 1.0.6, http://github.com/sWoRm/Spojo 
®Jbehave-core, http://jbehave.org/ 

’^Apache Shindig Gadgets, http://shindig.apache.org/ 

"Apache Commons Codec 1.9, http://commons.apache.org/codec/ 
'^Apache Software Foundation, http://apache.org/ 


• #Pure is the number of pure test cases on all program 
elements under consideration; 

• #Non-absolutely impure is the number of impure test 
cases (without impure test constinuent); 

• #Absolutely impure is the number of test cases that 
consist of at least one impure test constituent. 

The numbers of test cases in these three metrics are mapped 
to the three categories (a), (b), and (c) of test cases in Fig 
respectively. 

For test constituents, we use the following two metrics, 
i.e., #Total constituents and #Impure constituents. For program 
elements, we use metric #Purely covered program elements as 
Definition in Section [Ill-A| 

We leverage B-Refactoring to calculate evaluation metrics 
and to give an overview of the purity of test suites for the five 
projects. 


2) Results: We analyze the purity of test cases in our dataset 
with the metrics proposed in Section IV-Bl Table III shows the 
purity of test cases for if elements. In the project Lang, 539 
out of 2,254 (23.91%) test cases are pure for all the executed 
if elements while 371 (16.46%) and 1,344 (59.63%) test cases 
are impure without and with impure test constituents. In total, 
1,658 out of 5,465 (30.83%) test cases are pure for the all the 
executed if elements. These results show that there is space 
for improving the purity of test cases and achieving a higher 
percentage of pure test cases. 


As shown in the column Test constituent in Table ill 
33.81% of test constituents are impure. After applying test 
suite refactoring, all those impure constituents will be isolated 
in own test fragments . That is the number of absolutely impure 
constituents equals to the number of impure test cases after 
refactoring. 


In Table III we also present the purity of test cases 
according to the number of if elements. In the project Lang, 
2,263 out of 2,397 if elements are executed by the whole 
test suite. Among these executed if elements, 451 (19.93%) 
are purely covered. In total, among the five projects, 44.07% 
of if elements are purely covered. Hence, it is necessary to 
improve the purely covered if elements with our test case 
purification technique. 

























TABLE III; Purity of test cases for if elements according to the number of test cases, test constituents, and if elements. 



Test case 

Test constituent 

if element 

Project 

#Total 

Pure 

Non-absolutely impure 

Absolutely impure 

Total 

Impure 


Purely covered if 


# 

% 

# 

% 

# 

% 

# 

% 



# 

% 

Lang 

2,254 

539 

23.91% 

371 

16.46% 

1,344 

59.63% 

19,682 

5,705 

28.99% 

2,397 

2,263 

451 

19.93% 

Spojo-core 

133 

38 

28.57% 

5 

3.76% 

90 

67.67% 

999 

168 

16.82% 

87 

79 

45 

56.96% 

Jbehave-core 

457 

195 

42.67% 

35 

7.76% 

227 

49.67% 

3,631 

366 

10.08% 

428 

381 

230 

60.37% 

Shindig-gadgets 

2,002 

731 

36.51% 

133 

6.64% 

1,138 

56.84% 

14,063 

6,610 

47.00% 

2,378 

1,885 

1,378 

73.10% 

Codec 

619 

182 

29.40% 

123 

19.87% 

314 

50.73% 

3,458 

1,294 

37.42% 

507 

502 

148 

29.48% 

Total 

5,465 

1,685 

30.83% 

667 

12.20% 

3,113 

56.96% 

41,833 

14,143 

33.81% 

5,797 

5,110 

2,252 

44.07% 


TABLE IV; Purity of test cases for try elements according to the number of test cases, test constituents, and try elements. 



Test case 

Test constituent 

try element 

Project 

#Total 

Pure 

Non-absolutely impure 

Absolutely impure 

#Tntal 

#Impure 


Purely covered try 


# 

% 

# 

% 

# 

% 


# % 



# 

% 

Lang 

2,254 

295 

13.09% 

1,873 

83.1% 

86 

3.81% 

19,682 

276 1.40% 

73 

70 

35 

50.00% 

Spojo-core 

133 

52 

39.10% 

81 

60.9% 

0 

0.00% 

999 

0 0.00% 

6 

5 

5 

100.00% 

Jbehave-core 

457 

341 

74.62% 

91 

19.91% 

25 

5.47% 

3,631 

29 0.80% 

67 

57 

43 

75.44% 

Shindig-gadgets 

2,002 

1,238 

61.84% 

702 

35.06% 

62 

3.10% 

14,063 

73 0.52% 

296 

244 

221 

90.57% 

Codec 

619 

88 

14.22% 

529 

85.46% 

2 

0.32% 

3,458 

2 0.06% 

18 

16 

14 

87.50% 

Total 

5,465 

2,014 36.85% 

3,276 

59.95% 

175 

3.20% 

41,833 

380 0.91% 

460 

392 

318 

81.12% 


Eor try elements, we use the execution domain defined 
in Section III-Al and compute the same metrics. Table |IV| 
shows the purity of test cases for try elements. In Lang, 
295 out of 2,254 (13.09%) test cases are always pure for 
all the executed try elements. In total, the percentage of 
always pure and absolutely impure test cases are 36.85% and 
3.20%, respectively. In contrast to if elements in Table m 
the number of absolutely impure test cases in Spojo-core is 
zero. The major reason is that there is a much larger number 
of test cases in Lang (2254), compared to Spojo-core (133). In 
the five projects, based on the purity of test cases according to 
the number of try elements, 81.12% try elements are purely 
covered. 


Comparing the purity of test cases between if and try, 
the percentage of pure test cases for if elements and try 
elements are similar, 30.83% and 36.85%, respectively. In 
addition, the percentage of purely covered try elements is 
81.12% that is higher than that of purely covered if, i.e., 
44.07%. That is, 81.12% of try elements are executed by 
test cases with pure execution signatures but only 44.07% of 
if elements are executed by test cases with pure execution 
signatures. This comparison indicates that for the same project, 
different execution domains of input program elements result 
in different results for the purity of test cases. We can further 
improve the purity of test cases according to the execution 
domain (implying a criterion for purification) for a specific 
dynamic analysis technique. 


Answer to RQl; Only 31% (resp. 37%) of test cases are 
pure with respect to if elements (resp. try elements). 


C. Empirical Measurement of Refactoring Quality 


RQ2; Are test cases purer on individual program elements 
after applying our test suite refactoring technique? 


We evaluate whether our test case purihcation technique 
can improve the execution purity of test cases. Purified test 
cases cover smaller parts of the control flow; consequently, 
they will provide better support to dynamic analysis tasks. 

1) Protocol; To empirically assess the quality of our refac¬ 
toring technique with respect to purity, we employ the follow¬ 
ing metrics (see Definition |^; 

• #Purely covered program elements is the number of 
program elements, each of which is covered by all test 
cases with pure execution signatures; 

• #Program elements with at-least-one pure test case is the 
number of program elements, each of which is covered 
by at least one test case with a pure execution signature. 

Eor dynamic analysis, we generally aim to obtain a higher 
value of those two metrics after test suite refactoring. Eor 
each metric, we list the number of program elements before 
and after applying B-Refactoring as well as the improvement; 
absolute and relative ( 

' #Se/ore ■' 

2) Results; The hrst part of Table [V] shows the improvement 
of test case purity for if elements before and after applying 
B-Refactoring. Eor the project Lang, 2,263 if elements are 
executed by the whole test suite. After applying B-Refactoring 
to the test suite, 1,250 (from 451 to 1,701) if elements 
are changed to be purely covered. The relative improvement 
reaches 277.16% (1,250/451). Moreover, 884 (from 1,315 to 
2,199) if elements are changed to be covered with at-least- 
one pure test case. 

Eor all five projects, 1,494 purely covered if elements 
as well as 939 at-least-one purely covered if elements are 
obtained by applying B-Refactoring. These results indicate that 
the purity of test cases for i f elements is highly improved via 
test case purification. Note that the improvement on Lang is 
higher than that on the other four projects. A possible reason 
is that Lang behaves in a complex implementation and the 














































TABLE V: Test case purity by measuring the number of purely covered ifs and trys. The number of purely covered program 
elements increases after applying test case purification with B-Refactoring. 


Project 

#Executed if 

Purely covered if 

if with at-least-one pure test case 

#Before #After 

Improvement 

#Before #After 

Improvement 

# 

% 

# 

% 

Lang 

2,263 

451 

1,701 

1,250 277.16% 

1,315 

2,199 

884 

67.22% 

Spojo-core 

79 

45 

54 

9 

20.00% 

75 

78 

3 

4.00% 

Jbehave-core 

381 

230 

262 

32 

13.91% 

347 

355 

8 

2.31% 

Shindig-gadgets 

1,885 

1,378 

1,521 

143 

10.38% 

1,842 

1,856 

14 

0.76% 

Codec 

502 

148 

208 

60 

40.54% 

411 

441 

30 

7.30% 

Total for ifs 

5,110 

2,252 

3,746 

1,494 

66.34% 

3,990 

4,929 

939 

23.53% 


Project 

#Executed try 

Purely covered try 

try with at-least-one pure test case 

#Before #After 

Improvement 

#Before #After 

Improvement 

# 

% 

# 

% 

Lang 

70 

35 

58 

23 

65.71% 

61 

68 

7 

11.48% 

Spojo-core 

5 

5 

5 

0 

0.00% 

5 

5 

0 

0.00% 

Jbehave-core 

57 

43 

44 

1 

2.33% 

54 

54 

0 

0.00% 

Shindig-gadgets 

244 

221 

229 

8 

3.62% 

241 

242 

1 

0.41% 

Codec 

16 

14 

16 

2 

14.29% 

16 

16 

0 

0.00% 

Total for trys 

392 

318 

352 

34 

10.69% 

377 

385 

8 

2.12% 


original design of the test suite is only for software testing 
and maintenance but not for the usage in a dynamic analysis 
technique. 

Similarly, the second part of Table [V] shows the im¬ 
provement for try elements before and after applying B- 
Refactoring. In Lang, 23 (from 35 to 58) try elements are 
changed to be purely covered after applying B-Refactoring; 
7 (from 61 to 68) try elements are changed to at-least-one 
purely covered try elements. For all five projects, 34 (from 
318 to 352) try elements change to be purely covered after 
test case purification while 8 (from 377 to 385) try elements 
are improved in the sense that they become purely covered by 
at-least-one pure test cases. Note that for Spojo-core, no value 
is changed before and after test case purification due to the 
small number of test cases. 


Answer to RQ2: After test suite refactoring, i f and t ry 
elements are more purely executed. The purely covered 
if and try are improved by 66% and 11%, respectively. 


D. Mutation-based Validation for Refactored Test Suites 


RQ3; Do the automatically refactored test suites have the 
same fault revealing power as the original ones? 

In this section, we employ mutation testing to validate that 
a refactored test suite has the same behavior as the original 
one fig, in). 

1) Protocol: For each project, we generate mutants by 

injecting bugs to the program code. A mutant is killed by 
a test suite if at least one test case fails on this mutation. To 
evaluate whether a refactored test suite behaves the same as the 
original one, the two test suites should satisfy either of the two 
following rules: one mutant is killed by both the original test 
suite and the refactor one; or one mutant is not killed by both 
test suites. For three projects of our dataset, Lang, JBehave- 
core, and Codec, we randomly select 100 mutants per project. 


For each mutant, we individually run the original test suite and 
the purified test suite to check whether the mutant is killed. 

2) Results: Experimental results shows that both the two 

rules in Section IIV-DII are satisfied for all the mutants. In 
details, 81 mutants in Lang, 61 mutants in JBehave-core, and 
89 mutants in Codec are killed by both original and purified 
test suites while 18, 33, and 10 mutants are alive in both 
original and purified test suites, respectively. Moreover, 1, 6, 
and 1 mutants, respectively in three projects, lead both the 
original and refactored test suites to an infinite loop. To sum 
up, mutation-based validation for refactored test suites shows 
that our technique can provide the same behavior for the 
refactored test suites as the original test suites. 


Answer to RQ3: The test suites automatically refactored 
by B-Refactoring catch the same mutants as the original 
ones. 


V. Improving Dynamic Analysis Using Test Suite 
Reeactoring 

We apply our test suite refactoring approach, B- 
Refactoring, to improve two typical dynamic analysis tech¬ 
niques, automatic repair and exception contract analysis. 

A. Test Suite Refactoring for Automatic Repairing Three Bugs 


RQ4: Does B-Refactoring improve the automatic program 
repair of Nopol j^? 

To repair if-condition bugs, Nopol suffers from the am¬ 
biguous execution of test cases, each of which covers both 
then and else branches. In this section, we leverage test 
case purification to eliminate the ambiguity of test case execu¬ 
tion. In other words, we refactor test cases to convert original 
impure test cases into purified ones to assist automatic repair. 








































TABLE VI: Evaluation of the effect of purification for auto¬ 
matic repair for if-condition bugs. Traces of test cases after 
applying B-Refactoring (last column) enable a repair approach 
to find patches as the second column. 


ID 

Patch 

#Test cases 

Before After 

137371 

lastidx <= 0 

1 2 

137552 

len <011 pos > str.lengthO 

1 4 

904093 

className == null 

11 className.length() == 0 

2 3 


1) Protocol: We present a case study on three real-world 

bugs in Apache Commons Lang[^ All the three bugs are 
located in if-conditions. However, Nopol cannot directly fix 
these bugs because of the impurity of test cases. Thus, we use 
B-Refactoring to obtain purified test cases. This enables Nopol 
to repair those previously-unfixed bugs. 


2) Case Study 1: Table VI recapitulates three bugs, their 
patches and the created test cases. In total, nine pure test cases 
are obtained after applying test case purification to the original 
four test cases. Note that only the executed test cases for the 
buggy ifs are listed, not the whole test suite. We show how 
test suite refactoring influences the repair of the bug with ID 
137371 as follows. 


Fig. 0 shows a code snippet with a buggy if condition 
at Line 5 of bug with ID 137371. In Eig. the method 
chopNewLine aims to remove the line break of a string. 
The original if condition missed the condition of lastidx 
< 0. In Eig. a test case testChopNewLine targets this 
method. We show three test constituents, i.e., three assertions, 
in this test case (other test constituents are omitted for saving 
the space). The first two assertions cover the then branch 
of the if condition at Line 5 of chopNewLine while the 
last assertion covers the else branch. Such a test case will 
lead the repair approach to an ambiguous behavior; that is, 
the repair approach cannot find the covered branch of this test 
case. Hence, a repair approach cannot generate a patch for this 
bug. 

Based on B-Refactoring, our test case purification tech¬ 
nique can split the test case into two test cases, as shown at 
Line 10 in Fig. We replace the original test case with two 
new test cases after B-Refactoring. Then the repair approach 
can generate a patch, which is the same as the manual patch 
at Line 4 in Fig. 

Results on the two other bugs (with ID 137552 and ID 
904093) can be found in Section Appendix. B-Refactoring also 
enables Nopol to find the fixes on these bugs. For ID 904093, 
in addition to automatic refactoring, we also manually add 
a test case that specifies a missing situation, which was an 
omission in the original design of the test suite in Lang. 

To sum up, we have shown that our test case purification 
approach enables to automatically repair three previously- 
unfixed bugs, by providing a refactored version of the test 


*^For more details, visit https://fisheye6.atlassian.com/changelog/commons? 
cs=137371 https://fisheye6.atlassian.com/changelog/commons?cs=137552 
and https://fisheye6.atlassian.com/changelog/commons?cs=904093 


suite that produces traces that are optimized for the technique 
under consideration. 


Answer to RQ4: B-Refactoring improves the repairability 
of the Nopol program repair technique on three real- 
world bugs, which cannot be fixed before applying B- 
Refactoring. 


B. Test Suite Refactoring for Exception Contract Analysis 


RQ5: Does B-Refactoring improve the efficiency of the 
SCTA contract verification technique Q? 

In this section, we employ an existing dynamic analysis 
technique of exception contracts called Short-Circuit Testing 
Algorithm (SCTA), by Cornu et al. |j7j. SCTA aims to verify 
an exception handling contract called source-independence, 
which states that catch blocks should work in all cases when 
they catch an exception. Assertions in a test suite are used 
to verify the correctness of test cases. The process of SCTA 
is as follows. To analyze exception contracts, exceptions are 
injected at the beginning of try elements to trigger the catch 
branches; meanwhile, a test suite is executed to record whether 
a try or catch branch is covered by each test case. SCTA 
requires that test cases execute only the try or the catch. 

However, if both try and catch branches are executed 
by the same test case, SCTA cannot identify the coverage of 
the test case. In this case, the logical predicates behind the 
algorithm state that the contracts cannot be verified because the 
execution traces of test cases are not pure enough with respect 
to try elements. According to the terminology presented in 
this paper, we call such test cases covering both branches 
impure. If all the test cases that execute a try element 
are impure, no test cases can be used for identifying the 
source-independence. To increase the number of identified 
try elements and decrease the number of unknown ones, we 
leverage B-Refactoring to refactor the original test cases into 
purer test cases. 

1) Protocol: We apply test case purification on the five 

projects in Section |IV-A| The goal of this experiment is to 
evaluate how many try elements are recovered from unknown 
ones. We apply B-Refactoring to the test suite before analyzing 
the exception contracts. That is, we first refactor the test suite 
and then apply SCTA on the refactored version. 

We analyze exception contracts with the following met- 
ricsC!] 

• #Source-independent is the number of verified source- 
independent try elements; 

• #Source-dependent is the number of verified source- 
dependent try elements; 

• #Unknown is the number of unknown try elements, 
because all the test cases are impure. 

The last metric is the key one in this experiment. The goal is 
to decrease this metric by refactoring, i.e., to obtain less try- 

*^Note that the sum of the three metrics is constant before and after applying 
test suite refactoring. 

















1 String chopNewline(String str) { 

2 int lastidx = str.length() -1; 

3 

4 // PATCH: if (lastidx <= 0) I 

5 if (lastidx == 0) 

6 return 

7 char last = str.charAt(lastldx); 

8 if (last == '\n') 

9 if (str.charAt(lastIdx -1) == V) 

10 lastidx—; 

11 else 

12 lastldx++; 

13 return str.substring(0, lastidx); 

14 ) 

(a) Buggy program 


1 void testChopNewLine(){ 

2 ... 

3 assertEquals(FOO + "\n" + FOO, 

4 StringUtils.chopNewline(FOO 

5 + "\n" + FOO)); 

6 assertEquals(FOO + "b\n", 

7 StringUtils.chopNewline(FOO 

8 + "b\n\n")); 

9 

10 // B-refactoring splits here 

11 

12 assertEqualsC", 

13 StringUtils.chopNewline("\n")); 

14 } 

(b) Test case 


Fig. 3; Code snippets of a buggy program and a test case. The buggy if condition is at Line 5 of Fig. 3a the test case in Fig. 


3b executes both the then and else branches of the buggy if. Then B-Refactoring splits the test case into two test cases (at 


Line 10 in Fig. 3b i. 


TABLE VII: Test suites after test suite refactoring improves exception contracts by decreasing the number of unknown trys. 



Before 

After 

Improvement on #unknown 


#Source-independent #Source-dependent 

#Unknown 

#Source-independent #Source-dependent 

#Unknown 

# 

% 

Lang 

23 

5 

22 

37 

6 

7 

15 

68.18% 

Spojo-core 

1 

0 

0 

1 

0 

0 

0 

n/a 

Jbehave-core 

7 

2 

33 

8 

2 

32 

1 

3.03% 

Shindig-gadgets 

30 

12 

38 

31 

13 

36 

2 

5.26% 

Codec 

8 

0 

2 

10 

0 

0 

2 

100.00% 

Total 

69 

19 

95 

87 

21 

75 

20 

21.05% 


catch blocks, whose execution traces are too impure to apply 
the verification algorithm. 


2) Results: We investigate the results of the exception 

contract analysis before and after B-Refactoring. 


Table VII presents the number of source-independent try 
elements, the number of source-dependent try elements, and 
the number of unknown ones. Taking the project Lang as an 
example, the number of unknown try elements decreases 
by 15 (from 22 to 7). This enables the analysis to prove 
the source-independence for 14 more try (from 23 to 37) 
and to prove source-dependence for one more (from 5 to 6). 
That is, by applying test case purification to the test suite in 
project Lang, we can detect whether these 68.18% (15/23) try 
elements are source-independent or not. 


For all the five projects, 21.05% (20 out of 95) of try 
elements are rescued from unknown ones. This result shows 
that B-Refactoring can refactor test suites to cover simple 
branches of try elements. Such refactoring helps the dynamic 
analysis to identify the source independence. 


Answer to RQ5: Applying B-Refactoring to test suites 
improves the ability of verifying the exception contracts 
of SCTA. 21% of unknown exception contracts are re¬ 
duced. 


VI. Threats to validity 

We discuss threats to the validity of our B-Refactoring 
results. 

Generality. We have shown that B-Refactoring improves 
the efficiency of program repair and contract verification. 
However, the two considered approaches stem from our own 
research. This is natural, since our expertise in dynamic anal¬ 
ysis makes us aware of the nature of the problem. For further 
assessing the generic nature of our refactoring approach, future 
experiments involving other dynamic analysis techniques are 
required. 

Internal validity. Test code can be complex. For example, 
a test case can have loops and internal mocking classes. In 
our implementation, we consider test constituents as top-level 
statements, thus complex test constituents are simplified as 
atomic ones. Hence, B-Refactoring does not process these 
internal statements. 

Construct validity. Experiments in our paper mainly focus 
on if and try program elements. Both of these program 
elements can be viewed as a kind of branch statements. 
Our proposed work can also be applied to other elements, 
like method invocations (in Section |III-A1| |. To show more 
evidence of the improvement on dynamic analysis, more ex¬ 
periments could be conducted for different program elements. 

VIE Related Work 

We list the related work to our paper in three categories: 
the approach of test suite refactoring and two application 























scenarios. 


A. Test Case Refactoring 

Test code refactoring is a general concept of making 
test code better understandable, readable, or maintainable. 


Based on 11 test code smells, Deursen et al. 1371 first propose 
the concept of test case refactoring as wel as 6 test code 
refactoring rules, including reducing test case dependence and 
adding exploration for assertions. Extension on this work by 
Van Deursen & Moonen | [3^ and Pipka | |2^ propose how to 
refactor test code for the test first rule in extreme programming. 
Guerra & Fernandes GD defines a set of representation rules 
for different categories of test code refactoring. Moreover, Xu 
et al. propose directed test suite augmentation methods 
to detect affected code by code changes and to generate test 
cases for covering these code. 


Refactoring techniques in source code p4) have been 
introduced to test code. Existing known patterns in refactoring 
are applied to test cases to achieve better-designed test code. 
Chu et al. Q propose a pattern-based approach to refactoring 
test code to keep the correctness of test code and to remove 
the bad code smells. Alves et al. Q employ pattern-based 
refactoring on test code to make better regression testing via 
test case selection and prioritization. In contrast to modifying 
one test via the above pattern-based refactoring on test code, 
our work in this paper aims to split one test case into a set 
of small and pure test cases. The new test cases can assist a 
specific software task, e.g., splitting test cases to execute single 
branches in if elements for software repair and to trigger a 
specific status of try elements for exception handling. 


In our prior work |411, we proposed a test-case purification 
approach for fault localization. The differences with this paper 
are major: first, we address different problems (repair and 
verification versus fault localization); second, the purification 
technique is completely different (generic splitting both pass¬ 
ing and failing test cases versus assertion splitting in failing 
test cases). 


B. Automatic Software Repair 

Automatic software repair aims to generate patches to fix 
software bugs. Software repair employs a given set of test cases 
to validate the cortectness of generated patches. Weimer et al. 

propose GenProg, a genetic-programming based approach 
to fixing C bugs. This approach views a fraction of source 
code as an AST and updates ASTs by inserting and replacing 
known AST nodes. Nguyen et al. propose SemFix, 

a semantic-analysis based approach, also for C bugs. This 
approach combines symbolic execution, constraint solving, and 
program synthesis to narrow down the search space of repair 
expressions. 

Martinez & Monperrus m mine historical repair actions 
based on fine-granulated ASTs with a probabilistic model. Kim 
et al. 0 propose PAR, a pattern-based repair approach via 
common ways of fixing common bugs. The repair patterns 
in their work are used to avoid nonsensical patches due to 
the randomness of some mutation operators. Qi et al. p9) 
investigate the strength of random search in GenProg and show 
that the random search (without genetic programming) based 
repair method, RSRepair, can achieve even better performance 


than GenProg. Kaleeswaran et al. GZ) propose MintHint, a 
repair hint method for identifying expressions that are likely to 
occur in patches, instead of fully automated generating patches. 
Mechtaev et al. | [23| address the simplicity of generated patches 
with a maximum satisfiability solver. 


Barr et al. Q investigate the “plastic surgery” hypothesis in 
genetic programming based repair like GenProg and show that 
patches can be constructed via reusing existing code. Martinez 
et al. | |22| target the redundancy assumptions for existing code. 
Tao et al. p5| explore how to leverage machine-generated 
patches to assist human debugging. Monperi'us p6) discusses 
the problem statement and the evaluation criteria of software 
repair. Zhong & Su 1431 examine 9,000 real-world patches and 
summarize 15 findings for fault localization and faulty code 
fix in automatic repair. A recent study by Qi et al. pO) shows 
that only 2 out of 55 generated patches by GenProg and 2 out 
of generated patches by RSRepair are correct; all the others 
fail to be expected behaviors due to experimental issues and 
weak test cases. 


In existing work Q, we propose Nopol, a specific repair 
tool targeting buggy if conditions. In this paper, we leverage 
Nopol as a tool in one application scenario of automatic 
software repair, which investigates real-world bugs on if. 


C. Automatic Analysis of Exception Handling 

Exception handling aims to analyze and enhance the pro¬ 
cessing of software exceptions. Sinha & Harrold | [3^ propose 
representation techniques with explicit exception occurrences 
(explicitly via throw statements) and exception handling 
constructs. Their following work by Sinha et al. p4| develops 
a static and dynamic approach to analyzing implicit control 
flows caused by exception handling. 

Robillard & Murphy GD present the concept of exception- 
flow information and design a tool that supports the extraction 
and view of exception flows. Fu & Ryder develop a static 
exception-chain analysis for the entire exception propagation 
in programs. Zhang & Elbaum | |42| study the faults associated 
with exceptions that handle noisy resources and propose an 
approach to amplifying the space of exceptional behavior 
with external resources. Moreover, Bond et al. |j4| present an 
efficient origin tracking technique for null and undefined value 
errors in the Java virtual machine and a memory-checking 
tool. Mercadal et al. p5) propose an approach that relies on 
an architecture description language, which is extended with 
error-handling declarations. 

In existing work |j7), we propose an approach to detect the 
types of exception handling on nine Java projects. In this paper, 
the approach in |j7) serves as a platform to examine whether 
our test case purification approach can improve the ability of 
detecting exception contracts. 


VIII. Conclusions 

This paper addresses test suite refactoring. We propose B- 
Refactoring, a technique to split test cases into small fragments 
in order to increase the efficiency of dynamic program analysis. 
Our experiments on five open-source projects show that our 
approach effectively improves the purity of test cases. We show 
that applying B-Refactoring to existing analysis tasks, namely 





repairing if-condition bugs and analyzing exception contracts, 
enhances the capabilities of these tasks. 

In future work, we plan to apply B-Refactoring to other 
kinds of program analysis such as test suite prioritization. 
Moreover, we will explore the reason of designing impure test 
cases by analyzing and understanding existing tests in open- 
source projects. We also plan to extend our implementation 
of B-Refactoring to deal with more complex statements (e.g., 
loops) in test cases. 


Appendix. Case Studies on Repairing Real-World 

Bugs 


We evaluate our test suite refactoring technique on three 
real-world bugs in Apache Commons Lang. Detailed descrip¬ 
tion on these bugs can be found in Table 
study can be found in Section V-A and t 
studies are as follows. 


VI The first case 


le other two case 


A. Case study 2 

A code snippet in Fig. [^presents an if-condition bug with 
ID 137552 in Apache Commons Lang. In Fig. the method 
mid is to extract a fixed-length substring from a given position. 
The original if condition at Line 7 did not deal with the 
condition of len < 0, which is expected to return an empty 
string. In Fig.a test case testMid_String targets this 
method. Three assertions are shown to explain the coverage of 
branches. Two assertions at Line 5 and Line 14 cover the else 
branch of the if condition while the other assertion at Line 
10 covers the else branch. A repair approach, like Nopol, 
cannot generate a patch for this bug because the test case 
testMid_String covers both branches of the if condition 
at Line 7 in the method mid. 

We apply B-Refactoring to split the test case into four 
test cases, as shown at Lines 4, 9, and 13 in Fig. 

Such splitting can separate the coverage of then and else 
branches; that is, each new test cases only covers either the 
then or else branch. Then the repair approach can gen¬ 
erate a patch, ! (pos < str. length!) && len >= 
0), which is equivalent to the manual patch at Line 6 in Fig. 
I4al 


B. Case study 3 


This bug is with ID 904093 in Apache Commons Lang. 
Fig .|^ shows a code snippet with a buggy If condition at Line 
12. In Fig. 5a two methods getPackageName (Class) 
and getPackageName (String) work on extracting the 
package name of a class or a string. The original if con¬ 
dition missed checking the empty string, i.e., the condi¬ 
tion of className . length () == 0. In Fig. 5b two 


test cases examine the behavior of these two methods. For 
the first test case test_getPackageName_Class, we 
present three assertions. We do not refactor this test case 
because this test case is pure (the first and the third as¬ 
sertion execute the else branch while the second asser¬ 
tion does not execute any branch). For the second test 
case test_getPackageName_String, two assertions are 
shown. The first one is passing while the second is failing. 
Thus, we split this test case into two test cases to distinguish 
passing and failing test cases. 


Based on B-Refactoring, we obtain three test cases, as 
shown at Line 17 in Fig. Then the repair approach can 
generate a patch as className. length () == 0. Note 
that this patch is different from the real patch because the 
condition className == null is ignored. The reason is 
that in the original test suite, there exists no test case that 
validates the then branch at Line 13. 

To generate the same patch as the real patch at Line 9, 
we manually add one test case test_manually_add at 
Line 24 in Fig. This test case ensures the behavior of the 
condition className == null. Based on this manually 
added test case and the test cases by B-Refactoring, the repair 
approach can generate a patch that is the same as the real one. 

Summary. In summary, we empirically evaluate our B- 
Refactoring technique on three real-world if-condition bugs 
from Apache Commons Lang. All these three bugs cannot be 
originally repaired by the repair approach, Nopol. The reason 
is that one test case covers both the then and else branches. 
Then Nopol cannot decide which branch is covered and cannot 
generate the constraint for this test case. With B-Refactoring, 
we separate test cases into pure test cases to cover only the 
then or else branch. Based on the test cases after applying 
B-Refactoring, the first two bugs are fixed. The generated 
patches are the same as the manually-written patches. For the 
last bug, one test case is ignored by developers in the original 
test suite. By adding one ignored test case, this bug can also 
be fixed via the test suite after B-Refactoring. 
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1 String mid(String str, int pos, int len) { 

2 if (str == null) 

3 return null; 

4 

5 // PATCH: 

6 // if (len < 0 II pos > str.length()) 

7 if (pos > str.lengthO) 

8 return 

9 

10 if (pos < 0) 

11 pos = 0; 

12 if (str.lengthO <= (pos + len)) 

13 return str.substring(pos); 

14 else 

15 return str.substring(pos, pos + len); 

16 ) 

(a) Buggy program 


1 void testMid_String() { 

2 ... 

3 

4 // TODO: Split here 

5 assertEqualsC'b", StringUtils 

6 .mid(FOOBAR, 3, 1)); 

7 ... 

8 

9 // TODO: Split here 

10 assertEqualsC", StringUtils 

11 .mid(EOOBAR, 9, 3)); 

12 

13 // TODO: Split here 

14 assertEquals(EOO, StringUtils 

15 .mid(EOOBAR, -1, 3)); 

16 } 

(b) Test case 


Fig. 4: Code snippets of a buggy program and a test case in Case study 2. The buggy i f statement is at Line 7 in Fig. 4a while 
the test case in Fig. |4b| executes the then, the else, the then, and the else branches of the buggy statement, respectively. 
Then B-Refactoring splits the test case into four test cases. 


1 String getPackageName(Class els) { 

2 if (els == null) 

3 return StringUtils.EMPTY; 

4 return getPackageName(cls.getName()); 

5 } 

6 

7 String getPackageName(String className)( 

8 

9 //PATCH: if (className == null 

10 II className.lengthO == 0) 

11 

12 if (className == null) 

13 return StringUtils.EMPTY; 

14 while (className.charAt(O) == '[') 

15 className = className.substring(l); 

16 if (className.charAt(O) == 'L' && 

17 className.charAt(className 

18 .lengthO -1) == ';') 

19 className = className.substring(l); 

20 int i = className.lastIndexOf( 

21 PACKAGE_SEPARATOR_CHAR); 

22 if (i == -1) 

23 return StringUtils.EMPTY; 

24 return className.substring(0, i); 

25 } 


(a) Buggy program 


1 void test_getPackageName_Class() ( 

2 assertEqualsC'java.util", ClassUtils 

3 . getPackageN ame(Map .Entry. class) ); 

4 assertEqualsC", ClassUtils 

5 .getPackageName((Class)null)); 

6 assertEqualsC'java.lang", ClassUtils 

7 .getPackageName(String[]. class)); 

8 ... 

9 ) 

10 

11 void test_getPackageName_String() { 

12 ... 

13 assertEqualsC'java.util", ClassUtils 

14 .getPackageName( 

15 Map.Entry.class.getNameO)); 

16 

17 // TODO: Split here 

18 assertEqualsC", ClassUtils 

19 .getPackageNameC")); 

20 ) 

21 

22 // Manually added test case 

23 // to ensure the original condition 

24 void test_manually_add() { 

25 assertEqualsC", ClassUtils 

26 .getPackageName(null)); 

27 ) 

(b) Test cases 


Fig. 5: Code snippets of a buggy program and a test case in Case study 3. The buggy if statement is Line 12 of Fig. 5a while 
two test case in Fig. [5b] executes then and else branches of the buggy statement. B-Refactoring splits the second test case 
into two test cases and keeps the first test case. The last test case test_manually_add is manually added for explanation. 
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